-
Notifications
You must be signed in to change notification settings - Fork 264
TEXT-126: Adding Sorensen-Dice similarity algoritham #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/text/similarity/SorensenDicesSimilarityTest.java
Outdated
Show resolved
Hide resolved
Hi @ameyjadiye ! A new metric! 🎉 Added a few comments, but only spent a couple of minutes looking at the code. I will take a better look later, and read about this similarity (never heard about it). Thanks !!! |
Hi @ameyjadiye. Thanks for the contribution. I've had a look at this similarity as it is essentially a binary scoring metric: Thus you must compute the matches (true-positives) between the two sets
Or simply:
with The score is related to the The way you have defined the score is to take each unique character pair (bigram) as the set from each string. Thus:
The original article you referenced includes duplicate observations of each unique character pair (bigram). Thus:
This requires a different algorithm without using Q. Do you know if there is a preference in the field for parsing unique bigrams or including duplicates? Either way it should be clear in the Javadoc what the intension is. I also note that splitting the
You can then do what you want with these including writing an algorithm that sorts the bigrams for efficient comparison, or use them in your current algorithm with Also be careful of overflow in the result computation. I acknowledge that with the |
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/text/similarity/SorensenDicesSimilarityTest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished first round of reviews @ameyjadiye ! Great work, and thanks for the links. Today-I-Learned about the Sorensen-Dice similarity. Hadn't heard about it. However, in NLP F1-Score is quite common, and looks like it is a - if not the same? - Sorensen-Dice coefficient. Will read more later. Cheers
src/main/java/org/apache/commons/text/similarity/SorensenDicesSimilarity.java
Outdated
Show resolved
Hide resolved
Hi @kinow , looks like all suggestions are accommodated with readable code and considerable performance. |
Hi @ameyjadiye, The code is now a good working implementation. However it should be noted that the Sorensen-Dice similarity is a binary scoring metric. It can be applied to anything where you have two sets to be matched. It is closely related to the Jaccard score. This PR brings in a new algorithm to compute matches using pairs of characters (bigrams). This is a conflict with existing similarity measures in the package implementing JaccardSimilarity - uses The rest of the package is for an extension of CosineDistance - uses a So all the other measures use single characters or words (CosineDistance). If you want a Sorensen-Dice distance score using single characters then just use the This class is effectively a JaccardDistance with One suggestion would be a change to be something like However the current JaccardDistance uses Set which has space and efficiency advantages over Set. Once you are using strings then there is no reason to have only bigrams. My preference would be to add a new class Your Then you add a @kinow Opinions on a more flexible approach? |
I've had more of a think and have mocked an implementation for a generic See TEXT-155 and the code in the linked PR. This class can be used to compute the Sorensen-Dice similarity for bigrams or any other type of structure that you would like to analyse from the Note: Sorensen-Dice similarity is the same as the F1-score |
Been busy with a few other bugs these past two days, but hopefully will be able to review this and TEXT-155 this weekend 🎉 thanks a lot @aherbert ! @ameyjadiye sorry the delay, just a bit longer and we should be done with the review for your contribution (: |
I have just noticed that the |
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
https://github.com/aherbert/commons-text into feature-TEXT-155 # Conflicts: # src/test/java/org/apache/commons/text/similarity/IntersectionSimilarityTest.java
This adds Bag functionality to allow counting duplicate elements.
@ameyjadiye see last comment from @aherbert about empty strings and @aherbert while we are discussing #109 , do you think that is a blocker for this pull request? So far I think at least the API proposed here would be kept right? If so, this could be merged once the last comment is resolved, and then we can discuss how to organize the classes and where the sorensen-dice coefficient is calculated. I think the only thing missing is deciding on the name of the classes? Whether it should use I like the idea of having a descriptive name such as What do you think? (@ameyjadiye if you have any suggestion, feel free to chime in too 👍 [or any other person reading this 🙂 ]) |
@kinow @aherbert with nGram are we tampering the existing proved algoritham ? if its giving better results than existing algo I'm ok with that, also does someone really need it in real world examples ? Wikipedia says
not sure but are we over engineering similarities with #109 ? let me know if there is practicle use of nGram in real world ? would like to study it more. |
I'll try and summarise where we are:
This is the Possibles:
The library I found (python distance library) dealt with this by returning NaN. Others have been found that return 1 (because they are the same, even if they are the same of nothing). The current The new So this discrepancy in our own library should be resolved.
How do you split up a There are library examples for splitting a string into a set of characters (python distance library) and for using bigrams. I note that @kinow example, the python
This problem is solved by #109 which allows the user to define how to split the sequence with a function. The current The new Both use sets. So this discrepancy should be resolved. I think the solution is to support an n-gram size argument for the measure with default of 1. Then support a
This is similar to point 1 where you are actually scoring 0/0 again since you have no bigrams.
Currently this just throws an exception. However So if the library decides to not support Currently the stance in the Jaccard is it is a programming error to pass around Discussion To consolidate 0, 1 and 4 perhaps we can change the library to state that:
Then support n-gram size and both the This involves some decisions. My vote is:
As long as the Javadoc explains the chosen functionality then the I would be happy with any of the options. But my preference would be for equality being similar:
This is in-line with other similarity measures in the library which state equal sequences are perfectly similar. Implementing a more generic approach can be done using methods in #109. So perhaps hold this PR until that is resolved. Alex |
Will have time throughout this week to review with more calm (Sunday evening here now). But I suspect it would be easier to wait for the other pull request to be merged first, then simply rebase your code and update it? |
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
/** | ||
* Converter class for creating Bigrams for SorensenDice similarity. | ||
*/ | ||
private static class SorensenDiceConverter implements Function<CharSequence, Collection<Integer>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a singleton. No need to instantiate it for each use.
private static final SorensenDiceConverter INSTANCE = new SorensenDiceConverter()
But this functionality could also be delivered by using a method reference. It is also not related to SorensenDice and should be in a package level utils class something like:
CharSequenceConverterUtils ...
List<Character> toCharacterList(CharSequence cs);
List<Integer> toBigramList(CharSequence cs);
List<String> toNGramList(CharSequence cs, int n);
}); | ||
} | ||
|
||
public static double round(double value, int precision) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to be public. Why are you rounding anyway? A test like this can be very precise.
/** | ||
* For shifting bigrams to fit in single integer. | ||
*/ | ||
private static final int SHIFT_NUMBER = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not part of this class but the internal converter class.
Overall this class now works nicely. But I think that my initial reservations have not been met. There is no reason that this should use bigrams, or allow duplicate bigrams. If this goes through then you end up with a library that can compute the Jaccard with unique single characters (by using a
It really does not make sense to compute one with a different method from the other. I would actually state we have:
Where the number is the number of characters (n-gram), U is for unique (non-duplicates), and D is for duplicates. We actually have possibilities for the following in the library:
Or generically: [JS]N[UD] Putting in an S2D metric to complement a J1U metric makes a confusing and inconsistent library. BTW. I believe bigrams is better and duplicates is better. So this is the best algorithm. It just totally contradicts the current Jaccard implementation. A cleaner design is to have both metrics support single or bigrams (or even n-grams) and also allow or prevent duplicates via constructor arguments. The current Jaccard can be modified to do so by allowing the default constructor to default to single, non-duplicate mode and overloaded constructors added to it. This idea may be best achieved by pushing shared functionality into a
This can be made more generic if necessary. But this then leads to the library that @kinow identified (Simetrics) which can support tokenisation of input, case handling, splitting to n-grams, handling duplicates, extended unicode character handling, etc. This level of generalisation is out of scope for the current similarity package. But n-grams and duplicates should be in scope because this is the difference between this algorithm for S2D and the current library J1U algorithm. @ameyjadiye Would you like to add support for n-grams and unique/duplicate matching to this or attempt to add a package scope class that can be shared with the JaccardSimilarity? |
@aherbert sounds a good idea, let me create another PR for that, will comeback on this. |
@aherbert apologies I was busy with much other stuff, will have to go through this. this PR is real mess. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some minor errors in the javadoc
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
src/main/java/org/apache/commons/text/similarity/SorensenDiceSimilarity.java
Show resolved
Hide resolved
@ameyjadiye whenever you have time to update this PR, I will have another look and try to review/merge it 👍 thanks! |
Brilliant @ameyjadiye , but if you are busy having fun with commons-graph, we can leave this one for later. Whenever you have time 👍 |
Hi All. Where are we here? |
@garydgregory - Long pending issue, its development lost in discussions. I'll go through it once again and try to finish. thanks for reminding for one. :) |
Howdy folx, Do we need a net new PR here and new owner? Willing to cut a new PR with this code and shepherd it through another round of review. |
It has been a long time since I looked at this. Browsing this PR shows that I had a lot of reservations about what score should be implemented given there are many variants, and the API to support it. I don't think the current code is ready to merge or it would have been done; it stalled because it was a work in progress. What are you proposing to move this forward? |
if the major issue is on the API definition I'm fine picking this up and making the required updates, but if larger issue is returning an unexpected score/which variant should be implemented then I'm a little bit more at a loss. What is the standard approach in those situations, i.e. where there are multiple similar implementations that may result in unexpected results? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a good way forward would be to wait a few days (one two weeks?) to see if OP replies, otherwise I'd say anyone could just address the pending points first (latest reviews feedback) and we can then discuss whether the score/algo needs some rework before being ready to be merged. If noone volunteers I can at least try to solve the feedback in a separate commit in two weeks.
That sounds reasonable to me. |
Closing as will move work to #621 (see comment about code deleted from source branch there). |
This pull request adds the Sorensen-Dice Similarity Algorithm to Apache Commons Text.
The Sorensen–Dice coefficient is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sorensen and Lee Raymond Dice who published in 1948 and 1945 respectively.